Overview

This session will give hands-on feel on conducting a bioinformatics analysis

We will cover the following sections ;

  1. Introduction to Binformatics
  2. Introduction to R
  3. RNA-sequencing analysis

Section 1: The question that we will answer

Cancer is a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body. Cancer is caused by certain changes to genes.

What are the differentially expressed genes between normal cells and cancer cells? What do we mean by differentially expressed?

  • For each gene, we want to know:
  • Which genes are much more active in cancer?
  • Which genes are less active in cancer?
  • Which genes don’t really change?
Cancer is a genetic disease
Cancer is a genetic disease
  • Cancer is caused by changes to genes that control the way our cell function.
  • There are about 20,000 genes in the human genome.
  • Image taken from National Cancer Institute

Key Point: We want to be clear what analysis are we conducting.


Section 2: Introduction to R and RStudio

What is R

  • R is an open-source language and environment for statistical computing and graphics, widely used by scientists.
  • R is both a computational language and environment for statistical computing, data visualization, data science and machine learning
  • RStudio is an integrated development environment for R and Python
  • Rstudio provides a graphic user interface for working with R
  • In this session, we will showcase an cloud based RStudio Server -
  • User can install R and Rstudio locally on their device

Introduction to RStudio interface

  • Panel towards the top left is the scrip
  • Basic math function

Addition

3 + 3
## [1] 6

Multiplication

3 * 3
## [1] 9

Storing variables in R

num1 <- 5
num2 = 10

num1 + num2
## [1] 15

A more practical example. Lets create a vector storing multiple values

#create vectors to hold plant heights from each sample
group1 <- c(8, 8, 9, 9, 9, 11, 12, 13, 13, 14)
group2 <- c(22, 23, 24, 24, 25, 26, 27, 20, 26, 28)

Lets get the sum

8 + 8 + 9 + 9 + 9 + 11 + 12 + 13 + 13 + 14
## [1] 106

Now calculate the average (mean)

(8 + 8 + 9 + 9 + 9 + 11 + 12 + 13 + 13 + 14) / 10
## [1] 10.6

Using R functions

Key Point: R has built in function that we can use to perform mathematical / statistical operations.

Using the sum function to calculate the sum

sum(group1)
## [1] 106

Using the mean function to calculate the average (mean)

mean(group1)
## [1] 10.6

Visuzalize data

Key Point: R has built in function to quickly visualize data.

Given our previous example of group1 and group2, we can display the data as a barplot.

# Calculate means
means <- c(mean(group1), mean(group2))

# Make bar plot
barplot(means, names.arg = c("Group 1", "Group 2"),
        col = c("skyblue", "salmon"),
        main = "Average Plant Height",
        ylab = "Mean Height")

Other forms of vizulaisation using R

Statistical test

Key Point: R has built in function to perform statistical test.

A t-test is a statistical test that is used to compare the means of two groups. It is often used in hypothesis testing to determine whether two groups are different from one another.

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

where:
- \(\bar{x}_1, \bar{x}_2\) are the sample means
- \(s_1^2, s_2^2\) are the sample variances
- \(n_1, n_2\) are the sample sizes

Variance is simply the spread of the data

Click to show/hide code to generate the plot
df <- data.frame(
  value = c(group1, group2),
  group = c(rep("group1", length(group1)), rep("group2", length(group2)))
)

# Calculate means and variances
means <- tapply(df$value, df$group, mean)
vars <- tapply(df$value, df$group, var)

# Data for variance boxes
rects <- data.frame(
  group = c("group1", "group2"),
  xmin = c(0.7, 1.7),
  xmax = c(1.3, 2.3),
  ymin = means - vars/2,
  ymax = means + vars/2,
  mean = means,
  var = vars
)
ggplot(df, aes(x = group, y = value, color = group)) +
  geom_jitter(width = 0.1, size = 3, alpha = 0.7) +
  # Variance box
  geom_rect(data = rects, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), 
            fill = NA, color = "red", linetype = "dashed", inherit.aes = FALSE) +
  # Variance annotation
  geom_text(data = rects, aes(x = group, y = ymax + 10, 
                              label = paste0("Variance = ", round(var, 2))), 
            color = "red", size = 4, inherit.aes = FALSE) +
  # Mean point
  geom_point(data = rects, aes(x = group, y = mean), color = "black", size = 4, inherit.aes = FALSE) +
  labs(title = "Variance Visualized as a Box",
       y = "Value") +
  theme_bw()


Lets return to our example of group1 and group2. We want to test if the two values are statistically / significantly different. The t-test is performed using the t.test() function, which essentially tests for the difference in means of a variable between two groups.

t.test(group1, group2)
## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = -13.26, df = 17.932, p-value = 1.048e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.10295 -11.69705
## sample estimates:
## mean of x mean of y 
##      10.6      24.5

t.test saves a lot of information: the difference in means estimate, confidence interval for the difference conf.int, the p-value p.value, etc.


Without using the build in function, it will look something like this ;

# Data
group1 <- c(8, 8, 9, 9, 9, 11, 12, 13, 13, 14)
group2 <- c(22, 23, 24, 24, 25, 26, 27, 20, 26, 28)

# Calculate means
mean1 <- mean(group1)
mean2 <- mean(group2)

# Calculate sample variances
var1 <- var(group1)  # s1^2
var2 <- var(group2)  # s2^2

# Sample sizes
n1 <- length(group1)
n2 <- length(group2)

# Standard error using variances
se <- sqrt((var1 / n1) + (var2 / n2))

# t-statistic
t_stat <- (mean1 - mean2) / se

# Degrees of freedom (Welch's)
df <- ( (var1/n1 + var2/n2)^2 ) / 
      ( ((var1/n1)^2)/(n1-1) + ((var2/n2)^2)/(n2-1) )

# p-value
p_value <- 2 * pt(-abs(t_stat), df)

# Print results
cat("p-value:", p_value, "\n")
## p-value: 1.048483e-10

Section 3: RNA-sequencing analysis

Example of gene expression quantification
Example of gene expression quantification
  • The example shows the expression level of one gene.
  • How can we repeat the analysis for 20,000 genes?
  • Where does the gene expression data comes from? Sequencing

Example data

  • RNA-seq data from 24 patients
    • 12 normal
    • 12 cancer
  • About 20,000 genes

Step 1: Load files in R

Load required package. By default, R comes with some basic tools (functions), but for tasks (like RNA-seq analysis), you need extra tools. These extra tools are called packages. Loading a package (with library()) means you are telling R, “Please get these extra tools ready so I can use them in my analysis.”

library(DESeq2)
library(EnhancedVolcano)
library(dplyr)
library(scales)
library(pheatmap)

Lets read the actual data from an experiment comparing a tumor and normal RNA-seq experiment.

First we read the in the sample data and look at the files.

df.meta <- read.csv("../output/sample_data.csv")

Lets have a look at the data.

df.meta
##          Run  group
## 1  SRR975577 normal
## 2  SRR975581 normal
## 3  SRR975584 normal
## 4  SRR975571 normal
## 5  SRR975570 normal
## 6  SRR975578 normal
## 7  SRR975569 normal
## 8  SRR975576 normal
## 9  SRR975575 normal
## 10 SRR975572 normal
## 11 SRR975573 normal
## 12 SRR975574 normal
## 13 SRR975559  tumor
## 14 SRR975555  tumor
## 15 SRR975565  tumor
## 16 SRR975564  tumor
## 17 SRR975562  tumor
## 18 SRR975552  tumor
## 19 SRR975567  tumor
## 20 SRR975566  tumor
## 21 SRR975558  tumor
## 22 SRR975554  tumor
## 23 SRR975551  tumor
## 24 SRR975568  tumor

Check the number of samples (rows) in the data

nrow(df.meta)
## [1] 24

Check the number of tumor and normal samples

table(df.meta$group)
## 
## normal  tumor 
##     12     12

Now we read into R the RNA-seq data. Lets look at the data first 3 rows and 5 columns. Each row is a gene and each column is a sample.

raw.counts <- read.csv("../output/raw_counts.csv", row.names = 1)
raw.counts[1:3, 1:5]
##        SRR975577 SRR975581 SRR975584 SRR975571 SRR975570
## TSPAN6      1309       674       981      2271       691
## TNMD          19         5        10        20         4
## DPM1         389       268       419       743       278

How is the data structured? - Each row is a gene and each column is a sample.
What are these numbers? - They are the number of reads that mapped to each gene.

Example of distibution of reads across genes.
Example of distibution of reads across genes.

Key Point: Understand the data structure: genes in rows, samples in columns.

Step 2: Example of a differential analysis

Lets give the example of one gene. We would like to perform a statistical test (t.test) to determine if the expression of the gene is different in the normal and tumor samples.

We know the first 12 samples are normal and the next 12 are tumors.

# Extract counts for the first gene as a numeric vector
gene_counts <- as.numeric(raw.counts[1, ])

# Create vectors for normal and tumor samples
normal_counts <- gene_counts[1:12]
tumor_counts <- gene_counts[13:24]

# Perform t-test
t_test_result <- t.test(normal_counts, tumor_counts)

# Print results
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  normal_counts and tumor_counts
## t = -2.3454, df = 14.504, p-value = 0.03369
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2416.584  -111.916
## sample estimates:
## mean of x mean of y 
##  1380.083  2644.333

It may be helpful to also visualize the gene expression to understand what is going on.

# Assume df.meta$group contains "normal" and "tumor"
group_colors <- c(rep("skyblue", 12), rep("salmon", 12))

barplot(gene_counts, 
        col = group_colors,
        main = "Counts for Gene 1",
        ylab = "Counts")

Step 3: Performing RNA-seq analysis

We will have to do 20,000 t-tests. This is not practical. And there are other considerations that needs to be accounted for, for example normalization..

Lets now perform a differential expression analysis between tumor and normal samples.

One tools that we can use is DESeq2. This was designed by Michael Love, Wolfgang Huber and Simon Anders (bioinformatics group at EMBL Heidelberg, Germany).

Its starts by creating a DESeq2 object from the count data and the sample information.

dds <- DESeqDataSetFromMatrix(countData = raw.counts,
                              colData = df.meta,
                              design = ~ group)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

We can examine what is inside the DESeqDataSet object.

dds
## class: DESeqDataSet 
## dim: 18674 24 
## metadata(1): version
## assays(1): counts
## rownames(18674): TSPAN6 TNMD ... BLACAT1 GIMAP1-GIMAP5
## rowData names(0):
## colnames(24): SRR975577 SRR975581 ... SRR975551 SRR975568
## colData names(2): Run group

Next we estimate the size factors and normalize the data. This is to account for the differences in sequencing depth between samples.

dds <- estimateSizeFactors(dds)
vsd <- vst(dds)

What the steps above does is that it divides each sample’s sequencing count data by the size factors (total sequencing reads) to normalize the data. We can visualize the comparison below ;

Click to show/hide code to generate the plot
# Calculate raw library sizes
library_sizes_raw <- colSums(raw.counts)

# Choose a target library size (median or mean)
target_libsize <- median(library_sizes_raw)
# target_libsize <- mean(library_sizes_raw) # you can use mean if you prefer

# Normalize each sample so all columns sum to target_libsize
norm_counts_demo <- sweep(raw.counts, 2, library_sizes_raw, "/") * target_libsize
library_sizes_norm <- colSums(norm_counts_demo)

# Prepare data for plotting
df_libsize <- data.frame(
  Sample = rep(colnames(raw.counts), 2),
  LibrarySize = c(library_sizes_raw, library_sizes_norm),
  Group = rep(df.meta$group, 2),
  Stage = rep(c("Raw counts", "Normalized"), each = ncol(raw.counts))
) %>% 
  mutate(Stage = factor(Stage, levels=c("Raw counts", "Normalized")))

library(ggplot2)
ggplot(df_libsize, aes(x = Sample, y = LibrarySize, fill = Group)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ Stage, ncol = 2, ) +
  labs(title = "Before and After Normalization",
       y = "Total Sequencing Read Counts",
       x = "Sample") +
  scale_fill_manual(values = c("normal" = "skyblue", "tumor" = "salmon")) +
  theme_bw() +
  theme(panel.grid = element_blank(),
        axis.text.x = element_blank()) +
  scale_y_continuous(label=comma) +
  NULL

Once the data is normalized, we can vizualize the samples on a Principal Component Analysis (PCA) plot.

df_pca <- plotPCA(vsd, intgroup="group", returnData = TRUE)
## using ntop=500 top features by variance
plotPCA(vsd, intgroup="group") + scale_color_manual(values=c("skyblue", "salmon"))
## using ntop=500 top features by variance

A PCA plot is like a 2-D map that shows you how similar or different your samples are, using most (or all) of the gene expression data at once.

Basis of plotting data on a 2-D plot
Basis of plotting data on a 2-D plot

Image taken from StatQuest

Step 4: Perform differential expression analysis

We use the function DESeq() to perform the analysis.

dds <- DESeq(dds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 373 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

The key steps from the function above:

  • Size factor estimation: Taking into account the sequencing depth between samples.
  • Dispersion estimation:
    • Taking into account the variability of each gene.
    • Some genes are more variable than others.
    • Genes with higher expression levels are more variable (mean-dispersion relationship).
    • DESeq2 calculates an estimate of how variable (noisy) a gene is based on the sequencing counts across all samples.

Key Point: In summary, these steps are all about making sure DESeq2 understands both the average and the variability of each gene’s counts, so it can accurately tell which genes are truly different between your groups (e.g., tumor vs normal), and which are just noisy.

We can extract the results of DESeq2 comparison between the tumor vs normal group for each gene.

res <- results(dds)

The results are stored in the res object. We can view the top differentially expressed genes.

head(res[order(res$pvalue), ])
## log2 fold change (MLE): group tumor vs normal 
## Wald test p-value: group tumor vs normal 
## DataFrame with 6 rows and 6 columns
##         baseMean log2FoldChange     lfcSE      stat       pvalue        padj
##        <numeric>      <numeric> <numeric> <numeric>    <numeric>   <numeric>
## INHBA    453.212        6.26699  0.295228   21.2276 5.30617e-100 9.02050e-96
## STC2     250.030        5.03325  0.295610   17.0267  5.20987e-65 4.42839e-61
## ETV4     728.103        5.08567  0.306103   16.6143  5.49470e-62 3.11367e-58
## CPNE7    286.688        5.45681  0.331144   16.4787  5.22186e-61 2.21929e-57
## CDH3     655.717        5.85670  0.361411   16.2051  4.64210e-59 1.57831e-55
## CGREF1   164.795        3.94658  0.253266   15.5828  9.53524e-55 2.70165e-51
  • baseMean - Is the average (mean) normalized expression of a gene across all samples
  • log2FoldChange - The estimated difference in expression between the two groups, on a log2 scale.
    • How much more (or less) is this gene expressed in one group compared to the other?
    • A value of 1 means ‘twice as much’, -1 means ‘half as much’.
  • lfcSE - The standard error (uncertainty) of the log2 fold change estimate.
    • How sure are we about the log2 fold change? Smaller values mean more confidence.
  • stat - The test statistic (usually the Wald statistic), calculated as log2FoldChange divided by lfcSE.
    • How big is the difference compared to the noise? Bigger values mean a more convincing difference.
  • pvalue - The probability of seeing such a big difference (or bigger) just by chance, if there’s really no difference.
  • padj - The p-value adjusted for multiple testing.
    • A stricter version of the p-value that accounts for the fact that we’re testing thousands of genes.

Step 5: Vizualize the results

We can visualize the differential expression result of all the genes in a plot called the VolcanoPlot. We use the function EnhancedVolcano() to generate the plot.

EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'pvalue',
                title = 'Volcano plot')

Lets interpret the plot ;

  • Each dot is a gene
  • The x-axis shows how much the gene changes between groups (log2 fold change).
    • Genes far to the right are much higher in one group.
    • Genes far to the left are much lower in one group.
  • The y-axis shows how statistically significant that change is (the higher up, the more confident we are that the change is real and not just random noise).
    • Genes with higher expression levels are more variable (mean-dispersion relationship).

The plot looks like a volcano because most genes don’t change much (clustered in the middle), but a few genes shoot up on the sides—these are the most interesting ones!

We can visualize the differential expression result of selected significant genes as a heatmap. We use the function pheatmap() to generate the plot.